AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Watanabe, Chihiro, Sun, Jingyu

Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

arXiv.org Machine LearningFeb-19-2026

Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.

large language model, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

2602.16131

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Neural Information Processing SystemsFeb-18-2026, 00:22:22 GMT

f6b22ac37beb5da61efd4882082c9ecd-Paper-Conference.pdf

experience memory, large language model, machine learning, (19 more...)

Country:

Asia > China > Shanghai > Shanghai (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
North America > United States > New York > Richmond County > New York City (0.04)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.78)

Neural Information Processing SystemsFeb-17-2026, 16:33:45 GMT

b6e9d6f4f3428cd5f3f9e9bbae2cab10-Paper-Conference.pdf

large language model, machine learning, natural language, (18 more...)

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsFeb-16-2026, 13:17:13 GMT

8cea78701eb986f3ec357eb9b7c6badd-Paper-Conference.pdf

large language model, machine learning, natural language, (21 more...)

Country:

North America > United States (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Leisure & Entertainment > Games > Computer Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Game Theory (0.93)

arXiv.org Artificial IntelligenceDec-10-2025

Using LLMs in Generating Design Rationale for Software Architecture Decisions

Zhou, Xiyu, Li, Ruiyin, Liang, Peng, Zhang, Beiqi, Shahin, Mojtaba, Li, Zengyang, Yang, Chen

Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Models (LLMs), their capabilities in text comprehension, reasoning, and generation may enable the generation and recovery of DR for architecture decisions. In this study, we evaluated the performance of LLMs in generating DR for architecture decisions. First, we collected 50 Stack Overflow (SO) posts, 25 GitHub issues, and 25 GitHub discussions related to architecture decisions to construct a dataset of 100 architecture-related problems. Then, we selected five LLMs to generate DR for the architecture decisions with three prompting strategies, including zero-shot, chain of thought (CoT), and LLM-based agents. With the DR provided by human experts as ground truth, the Precision of LLM-generated DR with the three prompting strategies ranges from 0.267 to 0.278, Recall from 0.627 to 0.715, and F1-score from 0.351 to 0.389. Additionally, 64.45% to 69.42% of the arguments of DR not mentioned by human experts are also helpful, 4.12% to 4.87% of the arguments have uncertain correctness, and 1.59% to 3.24% of the arguments are potentially misleading. To further understand the trustworthiness and applicability of LLM-generated DR in practice, we conducted semi-structured interviews with six practitioners. Based on the experimental and interview results, we discussed the pros and cons of the three prompting strategies, the strengths and limitations of LLM-generated DR, and the implications for the practical use of LLM-generated DR.

large language model, machine learning, natural language, (18 more...)

2504.20781

Country: Asia > China > Hubei Province (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-5-2025

MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

Zeppieri, Stefano

Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.

artificial intelligence, large language model, natural language, (18 more...)

2512.0171

Genre: Research Report (0.40)

Industry:

Information Technology > Security & Privacy (0.93)
Health & Medicine (0.71)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceDec-4-2025

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Smith, Chandler, Abdulhai, Marwa, Diaz, Manfred, Tesic, Marko, Trivedi, Rakshit S., Vezhnevets, Alexander Sasha, Hammond, Lewis, Clifton, Jesse, Chang, Minsuk, Duéñez-Guzmán, Edgar A., Agapiou, John P., Matyas, Jayd, Karmon, Danny, Kundu, Akash, Korshuk, Aliaksei, Ananya, Ananya, Rahman, Arrasy, Kulandaivel, Avinaash Anand, McHale, Bain, Zhang, Beining, Alexander, Buyantuev, Rojas, Carlos Saith Rodriguez, Wang, Caroline, Talele, Chetan, Liu, Chenao, Lin, Chichen, Riazi, Diana, Shi, Di Yang, Tewolde, Emanuel, Tennant, Elizaveta, Zhong, Fangwei, Cui, Fuyang, Zhao, Gang, Piqueras, Gema Parreño, Yun, Hyeonggeun, Makarov, Ilya, Cui, Jiaxun, Purbey, Jebish, Dilkes, Jim, Nguyen, Jord, Xiao, Lingyun, Giraldo, Luis Felipe, Chacon-Chamorro, Manuela, Beltran, Manuel Sebastian Rios, Segura, Marta Emili García, Wang, Mengmeng, Alim, Mogtaba, Quijano, Nicanor, Schiavone, Nico, Macmillan-Scott, Olivia, Peña, Oswaldo, Stone, Peter, Kadiyala, Ram Mohan Rao, Fernandez, Rolando, Manrique, Ruben, Lu, Sunjia, McIlraith, Sheila A., Dhuri, Shamika, Shi, Shuqing, Gupta, Siddhant, Sarangi, Sneheel, Subramanian, Sriram Ganapathi, Cha, Taehun, Klassen, Toryn Q., Tu, Wenming, Fan, Weijian, Ruiyang, Wu, Feng, Xue, Du, Yali, Liu, Yang, Wang, Yiding, Kang, Yipeng, Sung, Yoonchang, Chen, Yuxuan, Zhang, Zhaowei, Wang, Zhihan, Wu, Zhiqiang, Chen, Ziang, Zheng, Zilong, Jia, Zixia, Wang, Ziyan, Hadfield-Menell, Dylan, Jaques, Natasha, Baarslag, Tim, Hernandez-Orallo, Jose, Leibo, Joel Z.

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

artificial intelligence, large language model, natural language, (15 more...)

2512.03318

Country:

North America > United States (0.46)
Asia > China (0.28)
Europe > United Kingdom > England (0.28)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.87)

Industry: Leisure & Entertainment > Games > Computer Games (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceDec-2-2025

Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation

Zhang, Ke, Zhao, Xiaoning, Zheng, Ce, Ning, Jiahong, Zhu, Dandan, Zhang, Wenqi, Sun, Chen, Sugawara, Toshiharu

This study proposes Tool-RoCo, a novel benchmark for evaluating large language models (LLMs) in long-term multi-agent cooperation based on RoCo, a multi-robot cooperative benchmark. Recent research on LLM-based multi-agent systems has relied on predefined orchestration, while ignoring agent autonomy. Tool-RoCo treats other agents as tools and introduces cooperative tools, leveraging tool usage to evaluate multi-agent cooperation and self-organization. Tool usage means that each agent (LLM) selects a tool from a candidate set based on the current state, receives feedback, and adjusts its selection in subsequent rounds. To evaluate different autonomy levels, we propose four LLM paradigms: (1) centralized cooperation, where a single LLM allocates tools to all agents; (2) centralized self-organization, where a central LLM autonomously activates agents while keeping others inactive; (3) decentralized cooperation, where each agent has its own LLM and calls tools based on local information; and (4) self-organization, where a randomly chosen initial agent can request collaboration, activating additional agents via tool calls. Tool-RoCo includes three multi-robot tasks, SORT, P ACK, and CABINET, to measure format and parameter accuracy and agent coordination through tool usage. The results using several LLMs showed that cooperative tools accounted for only 7.09% of all tools, indicating that LLM-based agents rarely invoked others as assistants. Moreover, activation tools accounted for 96.42%, suggesting that current LLMs tend to maintain active agents while seldom deactivating them for adaptive coordination. Tool-RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation in multi-agent tasks.

large language model, machine learning, natural language, (18 more...)

2511.2151

Country:

Asia > China (0.28)
Asia > Japan (0.28)

Genre: Research Report > New Finding (0.34)

arXiv.org Artificial IntelligenceNov-25-2025

Large Language Model-based Data Science Agent: A Survey

Chen, Ke, Wang, Peiran, Yu, Yaoning, Zhan, Xianyang, Wang, Haohan

The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents designed for data science tasks, summarizing insights from recent studies. From the agent perspective, we discuss the key design principles, covering agent roles, execution, knowledge, and reflection methods. From the data science perspective, we identify key processes for LLM-based agents, including data preprocessing, model development, evaluation, visualization, etc. Our work offers two key contributions: (1) a comprehensive review of recent developments in applying LLMbased agents to data science tasks; (2) a dual-perspective framework that connects general agent design principles with the practical workflows in data science.

large language model, machine learning, natural language, (18 more...)

2508.02744

Country:

Asia (0.28)
North America > United States > Illinois (0.14)

Genre:

Workflow (1.00)
Overview (1.00)
Research Report > Experimental Study (0.92)

Industry:

Health & Medicine (1.00)
Information Technology (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)